Bug 1425516

Summary:

Instance stuck resuming from suspend state during load test

Product:

Red Hat Enterprise Linux 7

Reporter:

Yuri Obshansky <yobshans>

Component:

qemu-kvm-rhev

Assignee:

Dr. David Alan Gilbert <dgilbert>

Status:

CLOSED NEXTRELEASE

QA Contact:

Prasanth Anbalagan <panbalag>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

7.3

CC:

berrange, dasmith, dgilbert, eglynn, hhuang, kchamart, knoel, nlevinki, pbonzini, rbryant, rcernin, rkharwar, sbauza, sferdjao, sgordon, srevivo, virt-maint, vromanso, yobshans

Target Milestone:

Target Release:

7.4

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-05-31 15:11:07 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
nova-compute log	none
Horizon dashboard screenshot	none
qemu log	none
part of nova log	none
qemu log	none

Description Yuri Obshansky 2017-02-21 15:50:57 UTC

Description of problem:
The problem detected during the REST API performance test which simulates 
different actions on instance in cycle manner (create, pause, unpause, suspend, resume, etc). Part of instances have stuck in resuming action forever.
Not found any critical exceptions in nova.log
For example instance ID=bb8d7ac0-b713-49a5-b2ed-bf6581d5d887  
2017-02-18 00:35:25.581 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] VM Started (Lifecycle Event)
2017-02-18 00:35:25.591 45517 INFO nova.virt.libvirt.driver [-] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] Instance spawned successfully.
2017-02-18 00:35:25.592 45517 INFO nova.compute.manager [req-e41b388f-a9f8-4d29-9df3-28b6dd765da3 1f0200313cd64cf3928d03b8967d0454 9f8bcbb13a064d6c902d7261f138ade8 - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] Took 27.97 seconds to spawn the instance on the hypervisor.
2017-02-18 00:35:25.978 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] VM Paused (Lifecycle Event)
2017-02-18 00:35:26.032 45517 INFO nova.compute.manager [req-e41b388f-a9f8-4d29-9df3-28b6dd765da3 1f0200313cd64cf3928d03b8967d0454 9f8bcbb13a064d6c902d7261f138ade8 - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] Took 29.83 seconds to build instance.
2017-02-18 00:35:26.202 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] VM Resumed (Lifecycle Event)
2017-02-18 00:35:26.504 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] VM Resumed (Lifecycle Event)
2017-02-18 00:35:27.926 45517 INFO nova.compute.resource_tracker [req-271d8403-04b9-4921-847d-0a34ae25bb47 - - - - -] Total usable vcpus: 24, total allocated vcpus: 20
2017-02-18 00:35:27.926 45517 INFO nova.compute.resource_tracker [req-271d8403-04b9-4921-847d-0a34ae25bb47 - - - - -] Final resource view: name=overcloud-compute-1.localdomain phys_ram=262098MB used_ram=43008MB phys_disk=930GB used_disk=400GB total_vcpus=24 used_vcpus=20 pci_stats=[]
2017-02-18 00:35:28.013 45517 INFO nova.compute.resource_tracker [req-271d8403-04b9-4921-847d-0a34ae25bb47 - - - - -] Compute_service record updated for overcloud-compute-1.localdomain:overcloud-compute-1.localdomain
2017-02-18 00:35:31.515 45517 INFO nova.compute.manager [req-a549bc12-020b-4246-8e1e-45cf85bc2e5f 7c54c9270fdc443cb63c5e497edeee56 1db40b57bd07416fad597c234704e025 - - -] [instance: 3246dc37-7e34-4df1-9707-b2f814ea58d8] Resuming
2017-02-18 00:35:33.966 45517 INFO nova.compute.manager [req-62686d70-bfb8-4477-bf8e-1c95a0e35b34 1f0200313cd64cf3928d03b8967d0454 9f8bcbb13a064d6c902d7261f138ade8 - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] Pausing
2017-02-18 00:35:34.490 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] VM Paused (Lifecycle Event)
2017-02-18 00:35:34.796 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] During sync_power_state the instance has a pending task (pausing). Skip.
2017-02-18 00:35:34.796 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] VM Paused (Lifecycle Event)
2017-02-18 00:35:41.798 45517 INFO nova.compute.manager [req-c5b7c53f-4f83-40d0-a189-bfaf3725b3e4 1f0200313cd64cf3928d03b8967d0454 9f8bcbb13a064d6c902d7261f138ade8 - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] Unpausing
2017-02-18 00:35:42.086 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] VM Resumed (Lifecycle Event)
2017-02-18 00:35:42.428 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] During sync_power_state the instance has a pending task (unpausing). Skip.
2017-02-18 00:35:42.430 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] VM Resumed (Lifecycle Event)
2017-02-18 00:35:51.179 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] VM Paused (Lifecycle Event)
2017-02-18 00:35:51.465 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] During sync_power_state the instance has a pending task (suspending). Skip.
2017-02-18 00:35:59.010 45517 INFO nova.compute.manager [req-6e5d3db2-1d19-4fc6-b28b-a546dd8c779a 1f0200313cd64cf3928d03b8967d0454 9f8bcbb13a064d6c902d7261f138ade8 - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] Resuming
2017-02-18 00:36:07.283 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] VM Started (Lifecycle Event)
2017-02-18 00:36:07.438 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] VM Resumed (Lifecycle Event)
2017-02-18 00:36:07.635 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] VM Paused (Lifecycle Event)
2017-02-18 00:36:07.983 45517 INFO nova.compute.manager [req-fe8be2dd-9607-4621-8f85-81a758c34b97 - - - - -] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] During sync_power_state the instance has a pending task (resuming). Skip.
2017-02-18 00:45:41.113 45517 INFO nova.compute.manager [-] [instance: bb8d7ac0-b713-49a5-b2ed-bf6581d5d887] During sync_power_state the instance has a pending task (resuming). Skip.
........
Full log attached to bug.
Horizon dashboard screenshot attached also.
Not reproducible in RHOS 8 or 9.

Version-Release number of selected component (if applicable):
rhos-release-10-p-2017-02-15.1

How reproducible:
Run "RHOS-10-REST-API-Perf-Test-Baseline"
Based on Test Plan: 
https://polarion.engineering.redhat.com/polarion/#/project/RHELOpenStackPlatform/wiki/Performance%20_%20Scale/RHOS%20Performance%20Test%20Plan

Steps to Reproduce:
1.
2.
3.

Actual results:
Instances have stuck in resuming action forever
And throw message in nova.log:
"During sync_power_state the instance has a pending task (resuming). Skip."

Expected results:
Instances should be resumed w/o problem.


Additional info:

Comment 1 Yuri Obshansky 2017-02-21 15:58:17 UTC

Created attachment 1256186 [details]
nova-compute log

Comment 2 Yuri Obshansky 2017-02-21 16:01:15 UTC

Created attachment 1256188 [details]
Horizon dashboard screenshot

Comment 3 Yuri Obshansky 2017-02-25 00:05:05 UTC

Found error in /var/log/libvirt/qemu/instance-000006d1.log 
(attached to bug) on compute node where instance was stucked.
KVM internal error. Suberror: 1
emulation failure
EAX=000000b5 EBX=00007a00 ECX=00005678 EDX=00000000
ESI=00000000 EDI=0000a45d EBP=000de800 ESP=0000fc2c
EIP=00008000 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 ffffffff 00809300
CS =a000 000a0000 ffffffff 00809300
SS =0000 00000000 ffffffff 00809300
DS =0000 00000000 ffffffff 00809300
FS =0000 00000000 ffffffff 00809300
GS =0000 00000000 ffffffff 00809300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT=     000f79b0 00000037
IDT=     00000000 00000000
CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff <ff> ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

Comment 4 Yuri Obshansky 2017-02-25 00:06:05 UTC

Created attachment 1257486 [details]
qemu log

Comment 5 Ruchika K 2017-03-03 15:27:10 UTC

Was able to encounter this bug even in non load situations quite easily (albeit in a virtualized environment)

Is there any way to workaround this to recover ?
Thank you

Comment 6 Dr. David Alan Gilbert 2017-03-03 15:41:49 UTC

What was the exact sequence of actions that happened to the failed VM prior to the failure?  If you happen to have a screen capture of the host prior to the suspend that would be great.
That certainly looks like the CPU state is toast.

Comment 7 Dr. David Alan Gilbert 2017-03-03 15:42:47 UTC

(In reply to Ruchika K from comment #5)
> Was able to encounter this bug even in non load situations quite easily
> (albeit in a virtualized environment)

Please state exactly how you triggered this and what makes you think it's the same bug?  Does your log have the KVM error in it?

Dave

> Is there any way to workaround this to recover ?
> Thank you

Comment 8 Yuri Obshansky 2017-03-03 16:35:37 UTC

(In reply to Dr. David Alan Gilbert from comment #6)
> What was the exact sequence of actions that happened to the failed VM prior
> to the failure?  
The load test flow is simple:
- boot instance
- pause instance
- unpause instance
- suspend instance
- resume instance
......etc
- delete instance
It ran in cycle, and failed on NOT first iteration.
Simulated 20 virtual users (threads) load with different tenants (not admin).
If you happen to have a screen capture of the host prior to
> the suspend that would be great.
Unfortunately, no 
> That certainly looks like the CPU state is toast.
The same load test was executed successfully on the same hardware using OSP 8 and 9.

Comment 9 Dr. David Alan Gilbert 2017-03-08 20:08:45 UTC

(In reply to Yuri Obshansky from comment #8)
> (In reply to Dr. David Alan Gilbert from comment #6)
> > What was the exact sequence of actions that happened to the failed VM prior
> > to the failure?  
> The load test flow is simple:
> - boot instance
> - pause instance
> - unpause instance
> - suspend instance
> - resume instance
> ......etc
> - delete instance
> It ran in cycle, and failed on NOT first iteration.
> Simulated 20 virtual users (threads) load with different tenants (not admin).
> If you happen to have a screen capture of the host prior to
> > the suspend that would be great.
> Unfortunately, no 
> > That certainly looks like the CPU state is toast.
> The same load test was executed successfully on the same hardware using OSP
> 8 and 9.

Would it be possible for you to boil this test down into one that can be run without the rest of openstack; something just using virsh would be ideal.

Dave

Comment 10 Yuri Obshansky 2017-03-08 20:42:24 UTC

(In reply to Dr. David Alan Gilbert from comment #9)
> 
> Would it be possible for you to boil this test down into one that can be run
> without the rest of openstack; something just using virsh would be ideal.
> 
> Dave

Sorry. Unfortunately, I can try nothing right now. I'm without any hardware. 
Waiting for servers from scale-lab.

Let's as  Ruchika K (rkharwar) to reproduce with virsh.
If it is possible.

Thank you.

Comment 11 Dr. David Alan Gilbert 2017-03-30 10:44:43 UTC

Hi Yuri,
  Can you retest this with the latest bleeding-edge seabios ROM please;
1.10.2/2 includes a fix that covers at least one known memory corruption-during-reboot bug, there's a chance that it's the one you're hitting.

Dave

Comment 12 Kashyap Chamarthy 2017-04-06 10:31:32 UTC

From IRC:


[danpb] pause / unpause in OpenStack terminology maps to   'suspend' / 'resume' commands in `virsh` suspend + resume in OpenStack terminology maps to 'managedsave' + 'start' in virsh

Based on the above, the reproducer at libvirt-level would be:

    $ virsh-{start, suspend, resume, managedsave, start}

Enabling libvirt log filters to get libvirt <-> QEMU interactions to see what commands libvirt is sending to QEMU.

Comment 13 Dr. David Alan Gilbert 2017-04-06 10:41:25 UTC

observations:
  a) The set of CPU flags in the qemu log is unusual - what host CPUs are you using?
  b) I suspect the 'boot instance' isn't waiting very long so that the managedsave is happening while the guest is still in the bios; but not sure.
  c) Chatting to danpb and kashtap, can we just confirm that this test is:
     loop {
        boot
        pause
        unpause
        suspend
        resume
        delete
     }

Comment 14 Yuri Obshansky 2017-04-06 16:17:19 UTC

(In reply to Dr. David Alan Gilbert from comment #13)
> observations:
>   a) The set of CPU flags in the qemu log is unusual - what host CPUs are
> you using?

This is server: Dell PowerEdge R620:
with 24 CPUs x Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz

>   b) I suspect the 'boot instance' isn't waiting very long so that the
> managedsave is happening while the guest is still in the bios; but not sure.

No, the test is waiting till an instance in UP state and continue flow.

>   c) Chatting to danpb and kashtap, can we just confirm that this test is:
>      loop {
>         boot
>         pause
>         unpause
>         suspend
>         resume
>         delete
>      }

The full flow is:
loop{
    01.NOVA.GET.Images
    02.NOVA.GET.Flavors
    03.NEUTRON.POST.Create.Network
    04.NEUTRON.POST.Create.Subnet
    05.NOVA.POST.Boot.Server
    00.NOVA.GET.Server.Details
    06.NOVA.POST.Pause.Server
    07.NOVA.POST.Unpause.Server
    08.NOVA.POST.Suspend.Server
    09.NOVA.POST.Resume.Server
    10.NOVA.POST.Soft.Reboot.Server
    11.NOVA.POST.Hard.Reboot.Server
    12.NOVA.POST.Stop.Server
    13.NOVA.POST.Start.Server
    14.NOVA.POST.Create.Image
    15.NOVA.GET.Image.Id
    16.NOVA.DELETE.Image
    17.NOVA.DELETE.Server
    18.NEUTRON.DELETE.Network
    19.NOVA.GET.Server.Id 
}

More details you can find 
https://polarion.engineering.redhat.com/polarion/#/project/RHELOpenStackPlatform/wiki/Performance%20_%20Scale/RHOS%20Performance%20Test%20Plan

Comment 15 Dr. David Alan Gilbert 2017-04-06 16:59:39 UTC

(In reply to Yuri Obshansky from comment #14)
> (In reply to Dr. David Alan Gilbert from comment #13)
> > observations:
> >   a) The set of CPU flags in the qemu log is unusual - what host CPUs are
> > you using?
> 
> This is server: Dell PowerEdge R620:
> with 24 CPUs x Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
> 
> >   b) I suspect the 'boot instance' isn't waiting very long so that the
> > managedsave is happening while the guest is still in the bios; but not sure.
> 
> No, the test is waiting till an instance in UP state and continue flow.

OK, that's more worrying if it's while the guest is running.

> >   c) Chatting to danpb and kashtap, can we just confirm that this test is:
> >      loop {
> >         boot
> >         pause
> >         unpause
> >         suspend
> >         resume
> >         delete
> >      }
> 
> The full flow is:
> loop{
>     01.NOVA.GET.Images
>     02.NOVA.GET.Flavors
>     03.NEUTRON.POST.Create.Network
>     04.NEUTRON.POST.Create.Subnet
>     05.NOVA.POST.Boot.Server
>     00.NOVA.GET.Server.Details
>     06.NOVA.POST.Pause.Server
>     07.NOVA.POST.Unpause.Server
>     08.NOVA.POST.Suspend.Server
>     09.NOVA.POST.Resume.Server
>     10.NOVA.POST.Soft.Reboot.Server
>     11.NOVA.POST.Hard.Reboot.Server
>     12.NOVA.POST.Stop.Server
>     13.NOVA.POST.Start.Server
>     14.NOVA.POST.Create.Image
>     15.NOVA.GET.Image.Id
>     16.NOVA.DELETE.Image
>     17.NOVA.DELETE.Server
>     18.NEUTRON.DELETE.Network
>     19.NOVA.GET.Server.Id 
> }

Thanks.
 
> More details you can find 
> https://polarion.engineering.redhat.com/polarion/#/project/
> RHELOpenStackPlatform/wiki/Performance%20_%20Scale/
> RHOS%20Performance%20Test%20Plan

I did look at that, it didn't make that much sense to me - but there again I just think at the qemu level.
Can I just confirm, this is running native on the host - no nesting or anything?

Comment 16 Yuri Obshansky 2017-04-06 17:03:49 UTC

(In reply to Dr. David Alan Gilbert from comment #15)
> I did look at that, it didn't make that much sense to me - but there again I
> just think at the qemu level.

I think, you are right.


> Can I just confirm, this is running native on the host - no nesting or
> anything?

Yes, openstack was deployed on baremetal servers (no VMs).

Comment 17 Dr. David Alan Gilbert 2017-04-25 12:36:33 UTC

is there any chance you can test on our current 7.4 world (kernel/bios/seabios); we've got a bunch of bios and other fixes around reboot that we know have fixed a few hangs and crashes; so it's certainly worth a try.

Comment 18 Yuri Obshansky 2017-04-26 16:43:22 UTC

(In reply to Dr. David Alan Gilbert from comment #17)
> is there any chance you can test on our current 7.4 world
> (kernel/bios/seabios); we've got a bunch of bios and other fixes around
> reboot that we know have fixed a few hangs and crashes; so it's certainly
> worth a try.
Hi, 
Do you mean rhel 7.4? 
AFAIK, Openstack supports only 7.3. 
Let me know, what do you suggest.
Gladly will do it.
Yuri

Comment 22 Yuri Obshansky 2017-05-09 20:13:44 UTC

Hi, 
Dave and me verified if this bug will be reproducible on RHEL 7.4 
Bug didn't reproduce when compute node has installed new packages from RHEL 7.4
kernel-3.10.0-663.el7.x86_64
qemu-kvm-rhev-2.9.0-3.el7.x86_64
seabios-bin-1.10.2-2.el7.noarch
seavgabios-bin-1.10.2-2.el7.noarch
Any instance didn't stuck on resume action.
Performance test result here:
http://yobshans.rdu.openstack.engineering.redhat.com/rhos-jmeter/result/2017-05-09-rhos-10-test-baseline-20x50-rhel-7.4/result.html

I attached 2 files:
- part from nova log (instance-a055bcc7-3097-4d8b-9883-526319d3ec00.txt)
- qemu log (instance-000010b5.log)

Now, the questions are:
- The issue has fixed in future releases, but what about RHOS 10?
- I'm not sure, but I believe it will reproduce in RHOS 11 as weel?
- Do we h ave any workaround?

Thank you
Yuri

Comment 23 Yuri Obshansky 2017-05-09 20:14:21 UTC

Created attachment 1277444 [details]
part of nova log

Comment 24 Yuri Obshansky 2017-05-09 20:14:46 UTC

Created attachment 1277445 [details]
qemu log

Comment 25 Dr. David Alan Gilbert 2017-05-10 08:18:46 UTC

(In reply to Yuri Obshansky from comment #22)
> Hi, 
> Dave and me verified if this bug will be reproducible on RHEL 7.4 
> Bug didn't reproduce when compute node has installed new packages from RHEL
> 7.4
> kernel-3.10.0-663.el7.x86_64
> qemu-kvm-rhev-2.9.0-3.el7.x86_64
> seabios-bin-1.10.2-2.el7.noarch
> seavgabios-bin-1.10.2-2.el7.noarch
> Any instance didn't stuck on resume action.
> Performance test result here:
> http://yobshans.rdu.openstack.engineering.redhat.com/rhos-jmeter/result/2017-
> 05-09-rhos-10-test-baseline-20x50-rhel-7.4/result.html
> 
> I attached 2 files:
> - part from nova log (instance-a055bcc7-3097-4d8b-9883-526319d3ec00.txt)
> - qemu log (instance-000010b5.log)
> 
> Now, the questions are:
> - The issue has fixed in future releases, but what about RHOS 10?
> - I'm not sure, but I believe it will reproduce in RHOS 11 as weel?
> - Do we h ave any workaround?

I don't know the timing for RHOS 11, so am not sure how it ties up with our 7.4 releases.

Workarounds: Well, the problem is we don't actually know what fixed it!
So I guess the next step would be to try reverting stuff to the 7.3 components and seeing which one retriggers the bug.

I suggest starting by reerting the seabios, seabgabios and retesting.  If that works revert the qemu-kvm-rhev, if that still works then revert the kernel and we should be back to where we were!
(You may need to look at what other dependencies were bought in during those updates, but it's most likely it's one of those 4 packages).

The next step after that would be to run a kvm-trace during your test and capture more details of the internal error.

Since we know it's one of seabios/kernel/qemu I've flipped the component to qemu-kvm-rhev.

Dave

> Thank you
> Yuri

Comment 26 Dr. David Alan Gilbert 2017-05-10 13:55:30 UTC

EAX=0000a0b5 EBX=ffffffff ECX=0002ffff EDX=000a0000
ESI=ffffffff EDI=ffffffff EBP=ffffffff ESP=000a8000
EIP=ffffffff EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
GDT=     000f79b0 00000037
IDT=     000f79ee 00000000
CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=5b 66 5e 66 c3 ea 5b e0 00 f0 30 36 2f 32 33 2f 39 39 00 fc <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 00 00 00 00 00 00 00 00 00 00 00 00 00

was from the two cases I saw yesterday that are a bit different to the one above

Comment 27 Dr. David Alan Gilbert 2017-05-10 14:29:20 UTC

bonzini points out it may be SMM related given the 'CS =a000 000a0000' in the original dump; and I'm suspicious the EDX and ESP in this dump point the same way.
We did disable SMM in later bioses and there's a kernel SMM fix as well; so if it's going away with 7.4 kernel/qemu/bios then that may well be the reason.

Comment 28 Dr. David Alan Gilbert 2017-05-12 18:55:52 UTC

Please test with just the BIOS packages reverted and let us know whether this stays fixed.

Once we know if that does it, I think we should mark it as fixed, and if OpenStack wants it they can ask for it Z-streaming.
(Hopefully the easiest fix would be disabling SMM in the BIOS).

Comment 29 Yuri Obshansky 2017-05-12 19:41:32 UTC

I'm without hardware again.
I'll re-test when I received the servers.
Sorry for delay.

Yuri

Comment 32 Yuri Obshansky 2017-05-31 13:04:50 UTC

Hi,
I updated packages 
from
seabios-bin-1.9.1-5.el7_3.2.noarch
seavgabios-bin-1.9.1-5.el7_3.2.noarch
to 
seabios-bin-1.10.2-3.el7.noarch
seavgabios-bin-1.10.2-3.el7.noarch
and rerun the test.
No stacked instances detected during the load test.
Test result -> 
http://yobshans.rdu.openstack.engineering.redhat.com/rhos-jmeter/result/2017-05-30-rhos-10-restapi-perf-test-20x50/
Looks like the issue is fixed in new (RHEL 7.4) seabios packages.
What's the next steps ?

Yuri

Comment 33 Dr. David Alan Gilbert 2017-05-31 15:11:07 UTC

Given that:
  a) Updating seabios to 7.4's seabios fixes it
  b) The errors are consistent with it being an SMM error which we disabled in 7.4's seabios.

I'm marking as closed->nextrelease.

Please ask if you want a backport.