Bug 1425516
Summary: | Instance stuck resuming from suspend state during load test | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Yuri Obshansky <yobshans> | ||||||||||||
Component: | qemu-kvm-rhev | Assignee: | Dr. David Alan Gilbert <dgilbert> | ||||||||||||
Status: | CLOSED NEXTRELEASE | QA Contact: | Prasanth Anbalagan <panbalag> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | unspecified | ||||||||||||||
Version: | 7.3 | CC: | berrange, dasmith, dgilbert, eglynn, hhuang, kchamart, knoel, nlevinki, pbonzini, rbryant, rcernin, rkharwar, sbauza, sferdjao, sgordon, srevivo, virt-maint, vromanso, yobshans | ||||||||||||
Target Milestone: | rc | ||||||||||||||
Target Release: | 7.4 | ||||||||||||||
Hardware: | x86_64 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2017-05-31 15:11:07 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Yuri Obshansky
2017-02-21 15:50:57 UTC
Created attachment 1256186 [details]
nova-compute log
Created attachment 1256188 [details]
Horizon dashboard screenshot
Found error in /var/log/libvirt/qemu/instance-000006d1.log (attached to bug) on compute node where instance was stucked. KVM internal error. Suberror: 1 emulation failure EAX=000000b5 EBX=00007a00 ECX=00005678 EDX=00000000 ESI=00000000 EDI=0000a45d EBP=000de800 ESP=0000fc2c EIP=00008000 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0000 00000000 ffffffff 00809300 CS =a000 000a0000 ffffffff 00809300 SS =0000 00000000 ffffffff 00809300 DS =0000 00000000 ffffffff 00809300 FS =0000 00000000 ffffffff 00809300 GS =0000 00000000 ffffffff 00809300 LDT=0000 00000000 0000ffff 00008200 TR =0000 00000000 0000ffff 00008b00 GDT= 000f79b0 00000037 IDT= 00000000 00000000 CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff <ff> ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Created attachment 1257486 [details]
qemu log
Was able to encounter this bug even in non load situations quite easily (albeit in a virtualized environment) Is there any way to workaround this to recover ? Thank you What was the exact sequence of actions that happened to the failed VM prior to the failure? If you happen to have a screen capture of the host prior to the suspend that would be great. That certainly looks like the CPU state is toast. (In reply to Ruchika K from comment #5) > Was able to encounter this bug even in non load situations quite easily > (albeit in a virtualized environment) Please state exactly how you triggered this and what makes you think it's the same bug? Does your log have the KVM error in it? Dave > Is there any way to workaround this to recover ? > Thank you (In reply to Dr. David Alan Gilbert from comment #6) > What was the exact sequence of actions that happened to the failed VM prior > to the failure? The load test flow is simple: - boot instance - pause instance - unpause instance - suspend instance - resume instance ......etc - delete instance It ran in cycle, and failed on NOT first iteration. Simulated 20 virtual users (threads) load with different tenants (not admin). If you happen to have a screen capture of the host prior to > the suspend that would be great. Unfortunately, no > That certainly looks like the CPU state is toast. The same load test was executed successfully on the same hardware using OSP 8 and 9. (In reply to Yuri Obshansky from comment #8) > (In reply to Dr. David Alan Gilbert from comment #6) > > What was the exact sequence of actions that happened to the failed VM prior > > to the failure? > The load test flow is simple: > - boot instance > - pause instance > - unpause instance > - suspend instance > - resume instance > ......etc > - delete instance > It ran in cycle, and failed on NOT first iteration. > Simulated 20 virtual users (threads) load with different tenants (not admin). > If you happen to have a screen capture of the host prior to > > the suspend that would be great. > Unfortunately, no > > That certainly looks like the CPU state is toast. > The same load test was executed successfully on the same hardware using OSP > 8 and 9. Would it be possible for you to boil this test down into one that can be run without the rest of openstack; something just using virsh would be ideal. Dave (In reply to Dr. David Alan Gilbert from comment #9) > > Would it be possible for you to boil this test down into one that can be run > without the rest of openstack; something just using virsh would be ideal. > > Dave Sorry. Unfortunately, I can try nothing right now. I'm without any hardware. Waiting for servers from scale-lab. Let's as Ruchika K (rkharwar) to reproduce with virsh. If it is possible. Thank you. Hi Yuri, Can you retest this with the latest bleeding-edge seabios ROM please; 1.10.2/2 includes a fix that covers at least one known memory corruption-during-reboot bug, there's a chance that it's the one you're hitting. Dave From IRC: [danpb] pause / unpause in OpenStack terminology maps to 'suspend' / 'resume' commands in `virsh` suspend + resume in OpenStack terminology maps to 'managedsave' + 'start' in virsh Based on the above, the reproducer at libvirt-level would be: $ virsh-{start, suspend, resume, managedsave, start} Enabling libvirt log filters to get libvirt <-> QEMU interactions to see what commands libvirt is sending to QEMU. observations: a) The set of CPU flags in the qemu log is unusual - what host CPUs are you using? b) I suspect the 'boot instance' isn't waiting very long so that the managedsave is happening while the guest is still in the bios; but not sure. c) Chatting to danpb and kashtap, can we just confirm that this test is: loop { boot pause unpause suspend resume delete } (In reply to Dr. David Alan Gilbert from comment #13) > observations: > a) The set of CPU flags in the qemu log is unusual - what host CPUs are > you using? This is server: Dell PowerEdge R620: with 24 CPUs x Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz > b) I suspect the 'boot instance' isn't waiting very long so that the > managedsave is happening while the guest is still in the bios; but not sure. No, the test is waiting till an instance in UP state and continue flow. > c) Chatting to danpb and kashtap, can we just confirm that this test is: > loop { > boot > pause > unpause > suspend > resume > delete > } The full flow is: loop{ 01.NOVA.GET.Images 02.NOVA.GET.Flavors 03.NEUTRON.POST.Create.Network 04.NEUTRON.POST.Create.Subnet 05.NOVA.POST.Boot.Server 00.NOVA.GET.Server.Details 06.NOVA.POST.Pause.Server 07.NOVA.POST.Unpause.Server 08.NOVA.POST.Suspend.Server 09.NOVA.POST.Resume.Server 10.NOVA.POST.Soft.Reboot.Server 11.NOVA.POST.Hard.Reboot.Server 12.NOVA.POST.Stop.Server 13.NOVA.POST.Start.Server 14.NOVA.POST.Create.Image 15.NOVA.GET.Image.Id 16.NOVA.DELETE.Image 17.NOVA.DELETE.Server 18.NEUTRON.DELETE.Network 19.NOVA.GET.Server.Id } More details you can find https://polarion.engineering.redhat.com/polarion/#/project/RHELOpenStackPlatform/wiki/Performance%20_%20Scale/RHOS%20Performance%20Test%20Plan (In reply to Yuri Obshansky from comment #14) > (In reply to Dr. David Alan Gilbert from comment #13) > > observations: > > a) The set of CPU flags in the qemu log is unusual - what host CPUs are > > you using? > > This is server: Dell PowerEdge R620: > with 24 CPUs x Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz > > > b) I suspect the 'boot instance' isn't waiting very long so that the > > managedsave is happening while the guest is still in the bios; but not sure. > > No, the test is waiting till an instance in UP state and continue flow. OK, that's more worrying if it's while the guest is running. > > c) Chatting to danpb and kashtap, can we just confirm that this test is: > > loop { > > boot > > pause > > unpause > > suspend > > resume > > delete > > } > > The full flow is: > loop{ > 01.NOVA.GET.Images > 02.NOVA.GET.Flavors > 03.NEUTRON.POST.Create.Network > 04.NEUTRON.POST.Create.Subnet > 05.NOVA.POST.Boot.Server > 00.NOVA.GET.Server.Details > 06.NOVA.POST.Pause.Server > 07.NOVA.POST.Unpause.Server > 08.NOVA.POST.Suspend.Server > 09.NOVA.POST.Resume.Server > 10.NOVA.POST.Soft.Reboot.Server > 11.NOVA.POST.Hard.Reboot.Server > 12.NOVA.POST.Stop.Server > 13.NOVA.POST.Start.Server > 14.NOVA.POST.Create.Image > 15.NOVA.GET.Image.Id > 16.NOVA.DELETE.Image > 17.NOVA.DELETE.Server > 18.NEUTRON.DELETE.Network > 19.NOVA.GET.Server.Id > } Thanks. > More details you can find > https://polarion.engineering.redhat.com/polarion/#/project/ > RHELOpenStackPlatform/wiki/Performance%20_%20Scale/ > RHOS%20Performance%20Test%20Plan I did look at that, it didn't make that much sense to me - but there again I just think at the qemu level. Can I just confirm, this is running native on the host - no nesting or anything? (In reply to Dr. David Alan Gilbert from comment #15) > I did look at that, it didn't make that much sense to me - but there again I > just think at the qemu level. I think, you are right. > Can I just confirm, this is running native on the host - no nesting or > anything? Yes, openstack was deployed on baremetal servers (no VMs). is there any chance you can test on our current 7.4 world (kernel/bios/seabios); we've got a bunch of bios and other fixes around reboot that we know have fixed a few hangs and crashes; so it's certainly worth a try. (In reply to Dr. David Alan Gilbert from comment #17) > is there any chance you can test on our current 7.4 world > (kernel/bios/seabios); we've got a bunch of bios and other fixes around > reboot that we know have fixed a few hangs and crashes; so it's certainly > worth a try. Hi, Do you mean rhel 7.4? AFAIK, Openstack supports only 7.3. Let me know, what do you suggest. Gladly will do it. Yuri Hi, Dave and me verified if this bug will be reproducible on RHEL 7.4 Bug didn't reproduce when compute node has installed new packages from RHEL 7.4 kernel-3.10.0-663.el7.x86_64 qemu-kvm-rhev-2.9.0-3.el7.x86_64 seabios-bin-1.10.2-2.el7.noarch seavgabios-bin-1.10.2-2.el7.noarch Any instance didn't stuck on resume action. Performance test result here: http://yobshans.rdu.openstack.engineering.redhat.com/rhos-jmeter/result/2017-05-09-rhos-10-test-baseline-20x50-rhel-7.4/result.html I attached 2 files: - part from nova log (instance-a055bcc7-3097-4d8b-9883-526319d3ec00.txt) - qemu log (instance-000010b5.log) Now, the questions are: - The issue has fixed in future releases, but what about RHOS 10? - I'm not sure, but I believe it will reproduce in RHOS 11 as weel? - Do we h ave any workaround? Thank you Yuri Created attachment 1277444 [details]
part of nova log
Created attachment 1277445 [details]
qemu log
(In reply to Yuri Obshansky from comment #22) > Hi, > Dave and me verified if this bug will be reproducible on RHEL 7.4 > Bug didn't reproduce when compute node has installed new packages from RHEL > 7.4 > kernel-3.10.0-663.el7.x86_64 > qemu-kvm-rhev-2.9.0-3.el7.x86_64 > seabios-bin-1.10.2-2.el7.noarch > seavgabios-bin-1.10.2-2.el7.noarch > Any instance didn't stuck on resume action. > Performance test result here: > http://yobshans.rdu.openstack.engineering.redhat.com/rhos-jmeter/result/2017- > 05-09-rhos-10-test-baseline-20x50-rhel-7.4/result.html > > I attached 2 files: > - part from nova log (instance-a055bcc7-3097-4d8b-9883-526319d3ec00.txt) > - qemu log (instance-000010b5.log) > > Now, the questions are: > - The issue has fixed in future releases, but what about RHOS 10? > - I'm not sure, but I believe it will reproduce in RHOS 11 as weel? > - Do we h ave any workaround? I don't know the timing for RHOS 11, so am not sure how it ties up with our 7.4 releases. Workarounds: Well, the problem is we don't actually know what fixed it! So I guess the next step would be to try reverting stuff to the 7.3 components and seeing which one retriggers the bug. I suggest starting by reerting the seabios, seabgabios and retesting. If that works revert the qemu-kvm-rhev, if that still works then revert the kernel and we should be back to where we were! (You may need to look at what other dependencies were bought in during those updates, but it's most likely it's one of those 4 packages). The next step after that would be to run a kvm-trace during your test and capture more details of the internal error. Since we know it's one of seabios/kernel/qemu I've flipped the component to qemu-kvm-rhev. Dave > Thank you > Yuri EAX=0000a0b5 EBX=ffffffff ECX=0002ffff EDX=000a0000 ESI=ffffffff EDI=ffffffff EBP=ffffffff ESP=000a8000 EIP=ffffffff EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA] SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy GDT= 000f79b0 00000037 IDT= 000f79ee 00000000 CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000000 Code=5b 66 5e 66 c3 ea 5b e0 00 f0 30 36 2f 32 33 2f 39 39 00 fc <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 was from the two cases I saw yesterday that are a bit different to the one above bonzini points out it may be SMM related given the 'CS =a000 000a0000' in the original dump; and I'm suspicious the EDX and ESP in this dump point the same way. We did disable SMM in later bioses and there's a kernel SMM fix as well; so if it's going away with 7.4 kernel/qemu/bios then that may well be the reason. Please test with just the BIOS packages reverted and let us know whether this stays fixed. Once we know if that does it, I think we should mark it as fixed, and if OpenStack wants it they can ask for it Z-streaming. (Hopefully the easiest fix would be disabling SMM in the BIOS). I'm without hardware again. I'll re-test when I received the servers. Sorry for delay. Yuri Hi, I updated packages from seabios-bin-1.9.1-5.el7_3.2.noarch seavgabios-bin-1.9.1-5.el7_3.2.noarch to seabios-bin-1.10.2-3.el7.noarch seavgabios-bin-1.10.2-3.el7.noarch and rerun the test. No stacked instances detected during the load test. Test result -> http://yobshans.rdu.openstack.engineering.redhat.com/rhos-jmeter/result/2017-05-30-rhos-10-restapi-perf-test-20x50/ Looks like the issue is fixed in new (RHEL 7.4) seabios packages. What's the next steps ? Yuri Given that: a) Updating seabios to 7.4's seabios fixes it b) The errors are consistent with it being an SMM error which we disabled in 7.4's seabios. I'm marking as closed->nextrelease. Please ask if you want a backport. |