1572447 – KVM: entry failed, hardware error 0xffffffff when booting rhel 7.5 guest in valgrind environment

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1572447 - KVM: entry failed, hardware error 0xffffffff when booting rhel 7.5 guest in valgrind environment

Summary: KVM: entry failed, hardware error 0xffffffff when booting rhel 7.5 guest in v...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	valgrind
Sub Component:
Version:	7.7
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	7.9
Assignee:	Mark Wielaard
QA Contact:	qe-baseos-tools-bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1572446 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-27 03:20 UTC by Guo, Zhiyi
Modified:	2019-12-11 19:29 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-12-11 19:29:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Guo, Zhiyi 2018-04-27 03:20:39 UTC

Description of problem:
KVM: entry failed, hardware error 0xffffffff when booting rhel 7.5 guest in valgrind environment

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.10.0-21.el7.x86_64
3.10.0-862.el7.x86_64
valgrind-3.13.0-10.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Boot guest with cli:
valgrind /usr/libexec/qemu-kvm -name nice -m 4G \
        -cpu Opteron_G5,enforce \
        -smp 4,cores=2 \
        -monitor stdio \
        -qmp unix:/tmp/qmp,server,nowait \
        -device cirrus-vga,vgamem_mb=4 \
        -serial unix:/tmp/console,server,nowait \
        -uuid 115e11b2-a869-41b5-91cd-6a32a907be7e \
        -drive file=rhel75.qcow2,if=none,id=drive-scsi-disk0,format=qcow2,cache=none,werror=stop,rerror=stop -device ide-hd,drive=drive-scsi-disk0,id=scsi-disk0 \
        -vnc :0 \
        -cdrom RHEL-7.5-20180322.0-Server-x86_64-dvd1.iso \

2.
3.

Actual results:
KVM: entry failed:
KVM: entry failed, hardware error 0xffffffff
RAX=0000000000000005 RBX=0000000000000000 RCX=0000000000000003 RDX=ffff8a65ffd80000
RSI=0000000000000048 RDI=0000000000000003 RBP=ffff8a64f547f960 RSP=ffff8a64f547f958
R8 =ffff8a65fff3a280 R9 =0000000000000100 R10=0000000000000100 R11=000142fd00c0aba0
R12=0000000000000282 R13=0000000000000107 R14=0000000000000000 R15=ffff8a65f96c4000
RIP=ffffffffae267ae7 RFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00800000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 ffffffff 00800000
FS =0000 00007f76b0c408c0 ffffffff 00800000
GS =0000 ffff8a65ffc00000 ffffffff 00800000
LDT=0000 0000000000000000 0000ffff 00000000
TR =0040 ffff8a65ffc04000 00002087 00008b00 DPL=0 TSS64-busy
GDT=     ffff8a65ffc0c000 0000007f
IDT=     ffffffffff528000 00000fff
CR0=8005003b CR2=00005617250df000 CR3=0000000035dba000 CR4=000406f0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=fd a0 53 f3 ae 48 89 e5 53 31 db 0f b7 0c 10 b8 05 00 00 00 <0f> 01 c1 5b 5d c3 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec


Expected results:
Guest can boot without problem

Additional info:
Until now, I can only reproduce this issue on some amd host like:
Model name:            AMD Opteron(tm) X3421 APU
Model name:            AMD Ryzen 5 PRO 1500 Quad-Core Processor

And I cannot reproduce this issue on AMD EPYC host

No such issue happen without valgrind environment.

Dmesg log:
[216539.375264] ------------[ cut here ]------------
[216539.375298] WARNING: CPU: 0 PID: 1498 at arch/x86/kvm/emulate.c:5653 x86_emulate_insn+0x38b/0xd00 [kvm]
[216539.375300] Modules linked in: vfat fat amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg joydev pcspkr k10temp fam15h_power i2c_piix4 shpchp pinctrl_amd video i2c_designware_platform i2c_designware_core acpi_cpufreq ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic amdkfd amd_iommu_v2 amdgpu i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci ttm libahci drm libata crct10dif_pclmul crct10dif_common tg3 crc32c_intel ptp pps_core i2c_hid i2c_core dm_mirror dm_region_hash dm_log dm_mod
[216539.375339] CPU: 0 PID: 1498 Comm: memcheck-amd64- Kdump: loaded Not tainted 3.10.0-862.el7.x86_64 #1
[216539.375340] Hardware name: HPE ProLiant MicroServer Gen10/ProLiant MicroServer Gen10, BIOS 5.12 03/02/2018
[216539.375342] Call Trace:
[216539.375350]  [<ffffffffb050d768>] dump_stack+0x19/0x1b
[216539.375355]  [<ffffffffafe916d8>] __warn+0xd8/0x100
[216539.375357]  [<ffffffffafe9181d>] warn_slowpath_null+0x1d/0x20
[216539.375370]  [<ffffffffc081854b>] x86_emulate_insn+0x38b/0xd00 [kvm]
[216539.375381]  [<ffffffffc07fa15d>] x86_emulate_instruction+0x1cd/0x700 [kvm]
[216539.375386]  [<ffffffffc0951e6e>] ud_interception+0x1e/0x40 [kvm_amd]
[216539.375389]  [<ffffffffc0957a54>] handle_exit+0x224/0xab0 [kvm_amd]
[216539.375400]  [<ffffffffc07f0ccc>] ? kvm_set_cr8+0x1c/0x20 [kvm]
[216539.375403]  [<ffffffffc0952f3a>] ? svm_vcpu_run+0x37a/0x610 [kvm_amd]
[216539.375413]  [<ffffffffc07f671d>] vcpu_enter_guest+0x64d/0x12c0 [kvm]
[216539.375425]  [<ffffffffc07fde58>] kvm_arch_vcpu_ioctl_run+0x358/0x480 [kvm]
[216539.375434]  [<ffffffffc07e3441>] kvm_vcpu_ioctl+0x2b1/0x650 [kvm]
[216539.375439]  [<ffffffffb002fb90>] do_vfs_ioctl+0x350/0x560
[216539.375442]  [<ffffffffafea5b0b>] ? recalc_sigpending+0x1b/0x70
[216539.375446]  [<ffffffffb00d82bf>] ? file_has_perm+0x9f/0xb0
[216539.375448]  [<ffffffffb002fe41>] SyS_ioctl+0xa1/0xc0
[216539.375452]  [<ffffffffb051f7d5>] system_call_fastpath+0x1c/0x21
[216539.375454] ---[ end trace 7758fa556d251181 ]---
[216539.375456] ------------[ cut here ]------------
[216539.375466] WARNING: CPU: 0 PID: 1498 at arch/x86/kvm/x86.c:364 exception_type+0x49/0x50 [kvm]
[216539.375467] Modules linked in: vfat fat amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg joydev pcspkr k10temp fam15h_power i2c_piix4 shpchp pinctrl_amd video i2c_designware_platform i2c_designware_core acpi_cpufreq ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic amdkfd amd_iommu_v2 amdgpu i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci ttm libahci drm libata crct10dif_pclmul crct10dif_common tg3 crc32c_intel ptp pps_core i2c_hid i2c_core dm_mirror dm_region_hash dm_log dm_mod
[216539.375491] CPU: 0 PID: 1498 Comm: memcheck-amd64- Kdump: loaded Tainted: G        W      ------------   3.10.0-862.el7.x86_64 #1
[216539.375493] Hardware name: HPE ProLiant MicroServer Gen10/ProLiant MicroServer Gen10, BIOS 5.12 03/02/2018
[216539.375494] Call Trace:
[216539.375496]  [<ffffffffb050d768>] dump_stack+0x19/0x1b
[216539.375499]  [<ffffffffafe916d8>] __warn+0xd8/0x100
[216539.375501]  [<ffffffffafe9181d>] warn_slowpath_null+0x1d/0x20
[216539.375511]  [<ffffffffc07f01e9>] exception_type+0x49/0x50 [kvm]
[216539.375521]  [<ffffffffc07fa36b>] x86_emulate_instruction+0x3db/0x700 [kvm]
[216539.375525]  [<ffffffffc0951e6e>] ud_interception+0x1e/0x40 [kvm_amd]
[216539.375528]  [<ffffffffc0957a54>] handle_exit+0x224/0xab0 [kvm_amd]
[216539.375538]  [<ffffffffc07f0ccc>] ? kvm_set_cr8+0x1c/0x20 [kvm]
[216539.375540]  [<ffffffffc0952f3a>] ? svm_vcpu_run+0x37a/0x610 [kvm_amd]
[216539.375551]  [<ffffffffc07f671d>] vcpu_enter_guest+0x64d/0x12c0 [kvm]
[216539.375562]  [<ffffffffc07fde58>] kvm_arch_vcpu_ioctl_run+0x358/0x480 [kvm]
[216539.375581]  [<ffffffffc07e3441>] kvm_vcpu_ioctl+0x2b1/0x650 [kvm]
[216539.375584]  [<ffffffffb002fb90>] do_vfs_ioctl+0x350/0x560
[216539.375586]  [<ffffffffafea5b0b>] ? recalc_sigpending+0x1b/0x70
[216539.375588]  [<ffffffffb00d82bf>] ? file_has_perm+0x9f/0xb0
[216539.375591]  [<ffffffffb002fe41>] SyS_ioctl+0xa1/0xc0
[216539.375594]  [<ffffffffb051f7d5>] system_call_fastpath+0x1c/0x21
[216539.375596] ---[ end trace 7758fa556d251182 ]---
[216539.375612] ------------[ cut here ]------------
[216539.375623] WARNING: CPU: 0 PID: 1498 at arch/x86/kvm/x86.c:364 exception_type+0x49/0x50 [kvm]
[216539.375624] Modules linked in: vfat fat amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg joydev pcspkr k10temp fam15h_power i2c_piix4 shpchp pinctrl_amd video i2c_designware_platform i2c_designware_core acpi_cpufreq ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic amdkfd amd_iommu_v2 amdgpu i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci ttm libahci drm libata crct10dif_pclmul crct10dif_common tg3 crc32c_intel ptp pps_core i2c_hid i2c_core dm_mirror dm_region_hash dm_log dm_mod
[216539.375647] CPU: 0 PID: 1498 Comm: memcheck-amd64- Kdump: loaded Tainted: G        W      ------------   3.10.0-862.el7.x86_64 #1
[216539.375649] Hardware name: HPE ProLiant MicroServer Gen10/ProLiant MicroServer Gen10, BIOS 5.12 03/02/2018
[216539.375650] Call Trace:
[216539.375652]  [<ffffffffb050d768>] dump_stack+0x19/0x1b
[216539.375654]  [<ffffffffafe916d8>] __warn+0xd8/0x100
[216539.375657]  [<ffffffffafe9181d>] warn_slowpath_null+0x1d/0x20
[216539.375666]  [<ffffffffc07f01e9>] exception_type+0x49/0x50 [kvm]
[216539.375677]  [<ffffffffc07f6cd8>] vcpu_enter_guest+0xc08/0x12c0 [kvm]
[216539.375688]  [<ffffffffc07fde58>] kvm_arch_vcpu_ioctl_run+0x358/0x480 [kvm]
[216539.375697]  [<ffffffffc07e3441>] kvm_vcpu_ioctl+0x2b1/0x650 [kvm]
[216539.375700]  [<ffffffffb002fb90>] do_vfs_ioctl+0x350/0x560
[216539.375702]  [<ffffffffafea5b0b>] ? recalc_sigpending+0x1b/0x70
[216539.375703]  [<ffffffffb00d82bf>] ? file_has_perm+0x9f/0xb0
[216539.375706]  [<ffffffffb002fe41>] SyS_ioctl+0xa1/0xc0
[216539.375708]  [<ffffffffb051f7d5>] system_call_fastpath+0x1c/0x21
[216539.375710] ---[ end trace 7758fa556d251183 ]---
[216539.375713] SVM: KVM: FAILED VMRUN WITH VMCB:
[216539.375737] SVM: VMCB Control Area:
[216539.375749] SVM: cr_read:            0010
[216539.375763] SVM: cr_write:           0010
[216539.375776] SVM: dr_read:            00ff
[216539.375790] SVM: dr_write:           00ff
[216539.375803] SVM: exceptions:         00060042
[216539.375817] SVM: intercepts:         00002e7fbdc48037
[216539.375834] SVM: pause filter count: 3000
[216539.375847] SVM: iopm_base_pa:       00000003e4a44000
[216539.375876] SVM: msrpm_base_pa:      00000003d8768000
[216539.375893] SVM: tsc_offset:         fffe63858e4b02f2
[216539.375909] SVM: asid:               58
[216539.375922] SVM: tlb_ctl:            0
[216539.375935] SVM: int_ctl:            010f0100
[216539.375949] SVM: int_vector:         00000000
[216539.375963] SVM: int_state:          00000000
[216539.375977] SVM: exit_code:          ffffffff
[216539.375992] SVM: exit_info1:         0000000000000000
[216539.376008] SVM: exit_info2:         0000000000000000
[216539.376025] SVM: exit_int_info:      00000000
[216539.376042] SVM: exit_int_info_err:  00000000
[216539.376057] SVM: nested_ctl:         1
[216539.376071] SVM: nested_cr3:         00000000d78f4000
[216539.376088] SVM: avic_vapic_bar:     0000000000000000
[216539.376106] SVM: event_inj:          800003ff
[216539.376122] SVM: event_inj_err:      00000000
[216539.376138] SVM: virt_ext:           0
[216539.376153] SVM: next_rip:           0000000000000000
[216539.376171] SVM: avic_backing_page:  0000000000000000
[216539.376188] SVM: avic_logical_id:    0000000000000000
[216539.376204] SVM: avic_physical_id:   0000000000000000
[216539.376220] SVM: VMCB State Save Area:
[216539.376233] SVM: es:   s: 0000 a: 0000 l: ffffffff b: 0000000000000000
[216539.376253] SVM: cs:   s: 0010 a: 029b l: ffffffff b: 0000000000000000
[216539.376273] SVM: ss:   s: 0018 a: 0c93 l: ffffffff b: 0000000000000000
[216539.376294] SVM: ds:   s: 0000 a: 0000 l: ffffffff b: 0000000000000000
[216539.376314] SVM: fs:   s: 0000 a: 0000 l: ffffffff b: 00007f76b0c408c0
[216539.376334] SVM: gs:   s: 0000 a: 0000 l: ffffffff b: ffff8a65ffc00000
[216539.376355] SVM: gdtr: s: 0000 a: 0000 l: 0000007f b: ffff8a65ffc0c000
[216539.376376] SVM: ldtr: s: 0000 a: 0000 l: 0000ffff b: 0000000000000000
[216539.376397] SVM: idtr: s: 0000 a: 0000 l: 00000fff b: ffffffffff528000
[216539.376417] SVM: tr:   s: 0040 a: 008b l: 00002087 b: ffff8a65ffc04000
[216539.376437] SVM: cpl:            0                efer:         0000000000001d01
[216539.376816] SVM: cr0:            000000008005003b cr2:          00005617250df000
[216539.377231] SVM: cr3:            0000000035dba000 cr4:          00000000000406f0
[216539.377621] SVM: dr6:            00000000ffff0ff0 dr7:          0000000000000400
[216539.378033] SVM: rip:            ffffffffae267ae7 rflags:       0000000000000046
[216539.378430] SVM: rsp:            ffff8a64f547f958 rax:          0000000000000005
[216539.378821] SVM: star:           0023001000000000 lstar:        ffffffffae91f670
[216539.379235] SVM: cstar:          ffffffffae9237d0 sfmask:       0000000000043700
[216539.379694] SVM: kernel_gs_base: 0000000000000000 sysenter_cs:  0000000000000010
[216539.380126] SVM: sysenter_esp:   0000000000000000 sysenter_eip: 00000000ae923450
[216539.380511] SVM: gpat:           0007050600070106 dbgctl:       0000000000000000
[216539.380901] SVM: br_from:        0000000000000000 br_to:        0000000000000000
[216539.381287] SVM: excp_from:      0000000000000000 excp_to:      0000000000000000

Comment 2 Guo, Zhiyi 2018-04-27 05:22:33 UTC

*** Bug 1572446 has been marked as a duplicate of this bug. ***

Comment 3 Wei Huang (AMD) 2018-05-01 16:00:27 UTC

FYI: One machine I tested didn't fail: AMD Opteron(tm) Processor 6128 HE. This one data point. I am borrowing the matched machine for debugging...

Comment 4 Wei Huang (AMD) 2018-05-01 21:07:17 UTC

So I can reproduce it on AMD Opteron(tm) X3421 APU machine. Now the weird thing from initial debugging showed that guest VM tried to run vmcall instruction. This is an Intel instruction. AMD guest isn't supposed to do it. 

   0:   0f 01 c1                vmcall 
   3:   5b                      pop    rbx
   4:   5d                      pop    rbp

Comment 5 Wei Huang (AMD) 2018-05-02 21:19:42 UTC

The HYPERCALL instruction from guest VM is arch-specific. For AMD machine, it uses vmmcall; for Intel, it uses VMCALL. These two instructions are different in op-code. Linux kernel is supposed to set X86_FEATURE_VMMCALL for AMD machine. Interestingly I didn't see this happened when valgrind is invoking the qemu-kvm command. I believe this BZ is caused by the call path change under valgrind mode.

Comment 6 Wei Huang (AMD) 2018-05-03 04:16:17 UTC

And update: according to cpuid, the guest VM detects that it is running on Intel CPU under valgrind. This is why guest VM skips AMD code path.

Comment 7 Eduardo Habkost 2018-05-03 14:26:24 UTC

(In reply to Wei Huang from comment #6)
> And update: according to cpuid, the guest VM detects that it is running on
> Intel CPU under valgrind. This is why guest VM skips AMD code path.

Interesting.  This might explain another issue where QEMU reports HLE/RTM as unavailable when under Valgrind (the Valgrind CPU f/m/s may be in the host_tsx_blacklisted() list).

Comment 8 Wei Huang (AMD) 2018-05-03 15:19:52 UTC

(In reply to Eduardo Habkost from comment #7)
> (In reply to Wei Huang from comment #6)
> > And update: according to cpuid, the guest VM detects that it is running on
> > Intel CPU under valgrind. This is why guest VM skips AMD code path.
> 
> Interesting.  This might explain another issue where QEMU reports HLE/RTM as
> unavailable when under Valgrind (the Valgrind CPU f/m/s may be in the
> host_tsx_blacklisted() list).

I saw some AMD-specific detection code in its source tree. However it is still unclear to me what happened (valgrind debugging facility is very poor). If you have any doubt about what valgrind provides, you can run "valgrind cpuid -r" to get the raw data.

Comment 9 Wei Huang (AMD) 2018-05-03 20:29:39 UTC

Here is the root cause. The following code is from guest_amd64_toIR.c in valgrind. When cpuid instruction is detected in program, valgrind takes over and installs a specific CPU type. The CPU type to be installed depends on the host hardware features. The code clearly shows that AMD CPU is only installed when archinfo->hwcaps feature has no SSE3, CX16, AVX2 (fName="amd64g_dirtyhelper_CPUID_baseline").

Newer AMD host obviously has these features. So valgrind will install Intel CPU (Core-i7 or Core-i5, etc.) instead. This causes the guest VM to run in an incorrect code path, triggering it to use "VMMCALL" instruction. "VMMCALL" caused SVM to fail during world switch.

      /* This isn't entirely correct, CPUID should depend on the VEX            
         capabilities, not on the underlying CPU. See bug #324882. */
      if ((archinfo->hwcaps & VEX_HWCAPS_AMD64_SSE3) &&
          (archinfo->hwcaps & VEX_HWCAPS_AMD64_CX16) &&
          (archinfo->hwcaps & VEX_HWCAPS_AMD64_AVX2)) {
         fName = "amd64g_dirtyhelper_CPUID_avx2";
         fAddr = &amd64g_dirtyhelper_CPUID_avx2;
         /* This is a Core-i7-4910-like machine */
      }
      else if ((archinfo->hwcaps & VEX_HWCAPS_AMD64_SSE3) &&
               (archinfo->hwcaps & VEX_HWCAPS_AMD64_CX16) &&
               (archinfo->hwcaps & VEX_HWCAPS_AMD64_AVX)) {
         fName = "amd64g_dirtyhelper_CPUID_avx_and_cx16";
         fAddr = &amd64g_dirtyhelper_CPUID_avx_and_cx16;
         /* This is a Core-i5-2300-like machine */
      }
      else if ((archinfo->hwcaps & VEX_HWCAPS_AMD64_SSE3) &&
	       (archinfo->hwcaps & VEX_HWCAPS_AMD64_CX16)) {
         fName = "amd64g_dirtyhelper_CPUID_sse42_and_cx16";
         fAddr = &amd64g_dirtyhelper_CPUID_sse42_and_cx16;
         /* This is a Core-i5-670-like machine */
      }
      else {
         /* Give a CPUID for at least a baseline machine, SSE2                  
            only, and no CX16 */
>>>>     fName = "amd64g_dirtyhelper_CPUID_baseline";
>>>>     fAddr = &amd64g_dirtyhelper_CPUID_baseline;
      }


Personally I think the design of valgrind never takes AMD machine into serious consideration. Therefore we shouldn't run it on AMD host until valgrind is fixed.

Comment 10 Wei Huang (AMD) 2018-05-16 15:38:31 UTC

Given that this issue is related to valgrind, not a virt's problem, I am closing this BZ.

If you really want to run qemu-kvm with valgrind on AMD host, you can download and compile valgrind yourslef. Remember to apply the following _hack_:

diff --git a/VEX/priv/guest_amd64_toIR.c b/VEX/priv/guest_amd64_toIR.c
index f4620306c..14e2b2016 100644
--- a/VEX/priv/guest_amd64_toIR.c
+++ b/VEX/priv/guest_amd64_toIR.c
@@ -22098,6 +22098,9 @@ Long dis_ESC_0F (
          fAddr = &amd64g_dirtyhelper_CPUID_baseline;
       }
 
+      fName = "amd64g_dirtyhelper_CPUID_baseline";
+      fAddr = &amd64g_dirtyhelper_CPUID_baseline;
+
       vassert(fName); vassert(fAddr);
       d = unsafeIRDirty_0_N ( 0/*regparms*/, 
                               fName, fAddr, mkIRExprVec_1(IRExpr_GSPTR()) );

Comment 11 Jeff Nelson 2018-05-17 14:26:05 UTC

Reassigning to valgrind based on comments 9 and 10.

Comment 12 Mark Wielaard 2018-06-10 20:23:18 UTC

The cpuid replacement code in valgrind is indeed not the nicest code there is. But it should expose the cpuid bits that are safe to use when running under valgrind on a particular host cpu. It is somewhat pessimistic however and could be more precise. As correctly pointed out above valgrind only has a handful of cpuid replacement types.

To know better how to fix this properly upstream could you tell me, or point me to the code in qemu, that tests for certain cpuid features? Or does it explicitly check the vendor id string?

BTW. It is sometimes helpful to run the user space cpuid utility under valgrind to show what cpu it emulates. e.g. $ valgrind -q cpuid

If you could tell me which feature bits/information qemu-kvm checks, that would be helpful.

Comment 13 Eduardo Habkost 2018-06-11 22:06:30 UTC

(In reply to Mark Wielaard from comment #12)
> If you could tell me which feature bits/information qemu-kvm checks, that
> would be helpful.

grep for 'host_cpuid' in target/i386/cpuid.c in QEMU code.  The full list is:

* host_vendor_fms()
Used to choose the CPUID family/model/stepping seen by the guest when using "-cpu host".  Probably harmless, unless they trigger model-specific quirks on the guest side.

* cpu_x86_fill_model_id()
Used to choose CPUID model_id seen by guest when using "-cpu host".  Looks harmless.

* max_x86_cpu_initfn()
Used to choose CPUID vendor id seen by guest when using "-cpu host".  Can probably trigger this bug if Valgrind CPUID vendor doesn't match the host.

* x86_cpu_load_def()
Used to choose CPUID vendor id seen by guest when using all the remaining CPU models except for "-cpu host".  Likely to be the cause of this bug.

* cpu_x86_cpuid()
Only used to define what's the CPU cache information seen by the guest when cache_info_passthrough is enabled. cache-info-passthrough should be enabled on very limited circumstances, anyway, so should be harmless.

* x86_host_phys_bits()
Used to define what the guest will see on CPUID[0x80000008].EAX[bits 0:7].  Probably harmless if set to a reasonable value.

Comment 14 Mark Wielaard 2018-06-18 12:52:40 UTC

Thanks a lot for the overview of all the places the cpuids are used in qemu.
It will take a bit more time to properly analyse and think of the correct way to handle all of this in valgrind. Moving to 7.7 for now since there is probably not enough time for 7.6. Please let me know if this is urgent and I should raise the priority.

BTW. Does this work when using a Intel processor based host? If so, what is the valgrind qemu-kvm command line used?

Comment 18 Mark Wielaard 2019-08-08 15:51:14 UTC

No work has been done on this, moving to 7.9.

Note You need to log in before you can comment on or make changes to this bug.