Bug 1386239
| Summary: | emulation failure for some of our machines | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Tomas Pelka <tpelka> | ||||
| Component: | seabios | Assignee: | Virtualization Maintenance <virt-maint> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 7.3 | CC: | bsd, chayang, hhuang, juzhang, knoel, mtessun, rbalakri, rkrcmar, virt-maint, zhguo | ||||
| Target Milestone: | rc | Keywords: | TestBlocker | ||||
| Target Release: | 7.4 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2017-06-06 14:41:41 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1395265, 1401400 | ||||||
| Attachments: |
|
||||||
|
Description
Tomas Pelka
2016-10-18 13:19:22 UTC
Created attachment 1212078 [details]
vm.xml
This issue is not critical yet, but it will be soon.
We have hypervisor with about 60 VM's that are connected to beaker, when majority of them (90%) are scheduled via beaker, which is easy when we run our mass run, than I start observing mentioned issue.
XML definition for these machines are the same and attached.
One other symptom, besides on auto pausing (only way how to resume from pause state is force reset) machines with mantion trace in c0, I can also see that some machines are stucked with 100% CPU load right after beaker complete provisioning. It need to be restarted in order to continue with scheduled tasks.
Maybe we are just overloading hypervisor (150G RAM, RAID5 SATA, 32 CPU cores)?
(In reply to Tomas Pelka from comment #1) > Created attachment 1212078 [details] > One other symptom, besides on auto pausing (only way how to resume from > pause state is force reset) machines with mantion trace in c0, Btw. the trace doesn't have a line that starts with code "Code=" and follows with a list of bytes at the bottom? Something like "Code=00 00 [...] 12 <34> 56 [...] ff ff" > I can also > see that some machines are stucked with 100% CPU load right after beaker > complete provisioning. It need to be restarted in order to continue with > scheduled tasks. The emulation error happens when trying to emulate an invalid instruction, but the trace doesn't say what the instruction is, so please run these commands on the host for misbehaving domains: virsh qemu-monitor-command $domain --hmp info status virsh qemu-monitor-command $domain --hmp info registers virsh qemu-monitor-command $domain --hmp 'x /i $pc' virsh qemu-monitor-command $domain --hmp 'x /64xb $pc-48' and share their results. Thanks. > Maybe we are just overloading hypervisor (150G RAM, RAID5 SATA, 32 CPU > cores)? No, overloading the hypervisor shouldn't cause an emulation error. The guests might misbehave because things don't execute as fast as they'd expect, but we'd like to fix everything else (and expectations of RHEL guests too). virsh # qemu-monitor-command qe-dell-ovs5-vm-12 --hmp info status VM status: paused (internal-error) virsh # qemu-monitor-command qe-dell-ovs5-vm-12 --hmp info registers EAX=0000008a EBX=ffd18f73 ECX=0000008a EDX=00000000 ESI=00000003 EDI=7ffb13c8 EBP=00000066 ESP=00006e98 EIP=7ffbcf88 EFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA] SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy GDT= 000f7cb0 00000037 IDT= 000f7cee 00000000 CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80 FPR0=0000000000000000 0000 FPR1=0000000000000000 0000 FPR2=0000000000000000 0000 FPR3=0000000000000000 0000 FPR4=0000000000000000 0000 FPR5=0000000000000000 0000 FPR6=0000000000000000 0000 FPR7=0000000000000000 0000 XMM00=00000000000000000000000000000000 XMM01=00000000000000000000000000000000 XMM02=00000000000000000000000000000000 XMM03=00000000000000000000000000000000 XMM04=00000000000000000000000000000000 XMM05=00000000000000000000000000000000 XMM06=00000000000000000000000000000000 XMM07=00000000000000000000000000000000 virsh # qemu-monitor-command qe-dell-ovs5-vm-12 --hmp 'x /i $pc' 0x000000007ffbcf88: enter $0xe7c6,$0x71 virsh # qemu-monitor-command qe-dell-ovs5-vm-12 --hmp 'x /64xb $pc-48' 000000007ffbcf58: 0x00 0x00 0x00 0x88 0xd9 0xd3 0xe0 0xb9 000000007ffbcf60: 0xe8 0x03 0x00 0x00 0x31 0xd2 0xf7 0xf1 000000007ffbcf68: 0x89 0x44 0x24 0x04 0xc7 0x04 0x24 0x77 000000007ffbcf70: 0x70 0x0f 0x00 0xe8 0xd7 0x68 0x13 0x80 000000007ffbcf78: 0xb0 0x34 0xe6 0x43 0x31 0xc0 0xe6 0x40 000000007ffbcf80: 0xe6 0x40 0xb1 0x8a 0x88 0xc8 0xe6 0x70 000000007ffbcf88: 0xc8 0xc6 0xe7 0x71 0xb0 0x8b 0xe6 0x70 000000007ffbcf90: 0xe4 0x71 0x83 0xe0 0x01 0x83 0xc8 0x02 One more thing CPU is SandyBridge one, I tried to set it manually to Nahalem to if that helps. Following outputs are with Nehalem set. (In reply to Radim Krčmář from comment #2) > (In reply to Tomas Pelka from comment #1) > > Created attachment 1212078 [details] > > One other symptom, besides on auto pausing (only way how to resume from > > pause state is force reset) machines with mantion trace in c0, > > Btw. the trace doesn't have a line that starts with code "Code=" and follows > with a list of bytes at the bottom? > Something like "Code=00 00 [...] 12 <34> 56 [...] ff ff" I can't see that line in /var/log/libvirt/qemu/domain.log, where can I find it? Also adding qemu-monitor-command outputs for running machine but stuck on reboot eating 100% CPU virsh # qemu-monitor-command qe-dell-ovs5-vm-18 --hmp info status VM status: running virsh # qemu-monitor-command qe-dell-ovs5-vm-18 --hmp info registers EAX=00000000 EBX=ffd18f73 ECX=0000008a EDX=00000000 ESI=00000003 EDI=7ffb13c8 EBP=00000066 ESP=00000003 EIP=7ffbcf16 EFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA] SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy GDT= 000f7cb0 00000037 IDT= 000f7cee 00000000 CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 FCW=037f FSW=0000 [ST=0] FTW=00 MXCSR=00001f80 FPR0=0000000000000000 0000 FPR1=0000000000000000 0000 FPR2=0000000000000000 0000 FPR3=0000000000000000 0000 FPR4=0000000000000000 0000 FPR5=0000000000000000 0000 FPR6=0000000000000000 0000 FPR7=0000000000000000 0000 XMM00=00000000000000000000000000000000 XMM01=00000000000000000000000000000000 XMM02=00000000000000000000000000000000 XMM03=00000000000000000000000000000000 XMM04=00000000000000000000000000000000 XMM05=00000000000000000000000000000000 XMM06=00000000000000000000000000000000 XMM07=00000000000000000000000000000000 virsh # qemu-monitor-command qe-dell-ovs5-vm-18 --hmp 'x /i $pc' 0x000000007ffbcf16: str -0x7cfebd(%ebp) virsh # qemu-monitor-command qe-dell-ovs5-vm-18 --hmp 'x /64xb $pc-48' 000000007ffbcee6: 0x2b 0x7c 0x24 0x44 0x1b 0x6c 0x24 0x48 000000007ffbceee: 0x69 0xdd 0x99 0x9e 0x36 0x00 0xb9 0x99 000000007ffbcef6: 0x9e 0x36 0x00 0x89 0xf8 0xf7 0xe1 0x01 000000007ffbcefe: 0xda 0x05 0xff 0x07 0x00 0x00 0x83 0xd2 000000007ffbcf06: 0x00 0x89 0xc6 0x89 0xd7 0x0f 0xac 0xfe 000000007ffbcf0e: 0x0b 0xc1 0xef 0x0b 0x8a 0x1d 0x6e 0x7b 000000007ffbcf16: 0x0f 0x00 0x8d 0x43 0x01 0x83 0xff 0x00 000000007ffbcf1e: 0x76 0x10 0x83 0xc6 0x01 0x83 0xd7 0x00 (In reply to Tomas Pelka from comment #3) > virsh # qemu-monitor-command qe-dell-ovs5-vm-12 --hmp 'x /i $pc' > 0x000000007ffbcf88: enter $0xe7c6,$0x71 This domain tried to execute an instruction that is emulated by KVM only if the second argument is 0. KVM can be improved to accept all values, but I'm starting to have doubts about the system -- there is no reason to pass a number bigger than 31 as the second paramater, because the CPU does a modulo 32 on it. Is the instruction from the original program? i.e. isn't there some memory corruption going on? (and if not, what program is doing that?) Are internal emulation errors from other guests also on stuck on this instruction? (In reply to Tomas Pelka from comment #6) > virsh # qemu-monitor-command qe-dell-ovs5-vm-18 --hmp 'x /i $pc' > 0x000000007ffbcf16: str -0x7cfebd(%ebp) > > virsh # qemu-monitor-command qe-dell-ovs5-vm-18 --hmp 'x /64xb $pc-48' > 000000007ffbcee6: 0x2b 0x7c 0x24 0x44 0x1b 0x6c 0x24 0x48 > 000000007ffbceee: 0x69 0xdd 0x99 0x9e 0x36 0x00 0xb9 0x99 > 000000007ffbcef6: 0x9e 0x36 0x00 0x89 0xf8 0xf7 0xe1 0x01 > 000000007ffbcefe: 0xda 0x05 0xff 0x07 0x00 0x00 0x83 0xd2 > 000000007ffbcf06: 0x00 0x89 0xc6 0x89 0xd7 0x0f 0xac 0xfe > 000000007ffbcf0e: 0x0b 0xc1 0xef 0x0b 0x8a 0x1d 0x6e 0x7b > 000000007ffbcf16: 0x0f 0x00 0x8d 0x43 0x01 0x83 0xff 0x00 > 000000007ffbcf1e: 0x76 0x10 0x83 0xc6 0x01 0x83 0xd7 0x00 This also doesn't make much sense ... "0f 00 8d 43 01 83 ff" is "str -0x7cfebd(%ebp)", but if the CPU was executing the code in the block above, then the instruction boundary would be elsewhere, likely forming "8a 1d 6e 7b 0f 00" and "8d 43 01", but never "0f 00 8d 43 01 83 ff" -- instruction length decoder is a very exposed part in KVM, so a bug there is less likely that corruption or a buggy jump. Please upload trace.dat after using trace-cmd record -e kvm:\* -P $thread_id_of_the_vcpu to see what the vcpu does while burning CPU. $thread_id_of_the_vcpu can learned with virsh qemu-monitor-command $domain --hmp info cpus (In reply to Tomas Pelka from comment #4) > One more thing CPU is SandyBridge one, I tried to set it manually to Nahalem > to if that helps. Following outputs are with Nehalem set. So the host CPU is Sandy Bridge? (I assumed Nehalem, because it was in vm.xml.) Please run this on the host, it will allow us to tell under which conditions is KVM expected to emulate instructions: grep . /sys/module/kvm{,_intel}/parameters/* (In reply to Tomas Pelka from comment #5) > (In reply to Radim Krčmář from comment #2) > > (In reply to Tomas Pelka from comment #1) > > > Created attachment 1212078 [details] > > > One other symptom, besides on auto pausing (only way how to resume from > > > pause state is force reset) machines with mantion trace in c0, > > > > Btw. the trace doesn't have a line that starts with code "Code=" and follows > > with a list of bytes at the bottom? > > Something like "Code=00 00 [...] 12 <34> 56 [...] ff ff" > > I can't see that line in /var/log/libvirt/qemu/domain.log, where can I find > it? If it's not right under EFER, then QEMU likely didn't print it ... not that important now, it just complicates debugging. Hi Zhiyi, Could you handle this needinfo? Best Regards, Junyi Any luck with reproducing this on other machine? Thanks. The bug was caused by seabios memory corruption. *** This bug has been marked as a duplicate of bug 1428347 *** |